Show the code
pacman::p_load(tidyverse, lubridate, knitr, DT, ggplot2, plotly, ggthemes, ggfortify, forecast, MLmetrics, tsbox, xts, imputeTS, tseries, hrbrthemes, autoplotly)Goh Si Hui
February 25, 2024
March 11, 2024
In this exercise, we will be preparing the time-series forecasting module of the proposed Shiny Application and complete the following task:
Evaluate and determine the necessary R packages needed for my group project’s Shiny application;
Prepare and test the specific R codes can be run and returned the correct output as expected;
Determine the parameters and outputs that will be exposed on the Shiny applications; and
Select the appropriate Shiny UI components for exposing the parameters determined above.
Before we start, let us ensure that the required R packages have been installed and import the relevant data for this exercise.
For this exercise, we will be using the following packages:
tidyverse
xts
lubridate
tsbox
imputeTS
DT
ggplot2
plotly
ggthemes
hrbrthemes
forecast
MLmetrics
The code chunk below uses p_load() of pacman package to check if the packages are installed in the computer. If they are, they will be launched in R. Otherwise, pacman will install the relevant packages before launching them.
For this exercise, we will be using the historical daily weather records from weather.gov.sg. We retrieved the daily records data from Jan 1980 to Dec 2023 via data.gov.sg’s API. The daily historical weather records is in csv file format.
We use read_csv() function of readr to import the daily_historical csv file into R then we will use glimpse() of dplyr to learn about the associated attribute information in the dataframe.
There is no date but there are columns year, month and day. In addition, we also note that R currently read columns year, month and day as numeric data.
There are different columns for rainfall and temperature so we will need to select and filter the relevant columns that we want in subsequent steps.
The entire dataset daily_historical.csv is very large. We should save the filtered data into an R data format (RDS) so that we can easily retrieve it in future without importing the entire daily_historical.csv dataset again.
Let us first create a date column, called tdate using paste(), mutate() and lubridate’s ymd() to convert the numeric data type into a date data type and year-month-day format.
Now we will select the relevant temperature related columns needed for this exerise using select(). We will examine the resultant dataframe temp using str().
temp) is a tibble dataframe with the following columns:station: refers to the weather station that collected this temperature reading.tdate: refers to the date of the temperature reading collected.mean_temperature: refers to the mean temperature reading of that day.maximum_temperature: refers to the highest temperature reading of that day.minimum_temperature: refers to the lowest temperature reading of that day.Let us save the dataframe into an RDS file for future usage using write_rds().
We will bring in the temperature data using read_rds(). Let us check the imported RDS data using str() again.
tibble [329,156 × 5] (S3: tbl_df/tbl/data.frame)
$ station : chr [1:329156] "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" ...
$ tdate : Date[1:329156], format: "1980-01-01" "1980-01-02" ...
$ mean_temperature : num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
$ maximum_temperature: num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
$ minimum_temperature: num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
tdate as date data type and the temperature related columns are seen as numeric.First, let us use summary to have a sense of the missing data.
station tdate mean_temperature maximum_temperature
Length:329156 Min. :1980-01-01 Min. :22.20 Min. :22.80
Class :character 1st Qu.:1997-04-29 1st Qu.:27.10 1st Qu.:30.50
Mode :character Median :2011-09-18 Median :27.90 Median :31.70
Mean :2007-07-02 Mean :27.87 Mean :31.49
3rd Qu.:2017-11-27 3rd Qu.:28.80 3rd Qu.:32.60
Max. :2023-12-31 Max. :31.50 Max. :38.00
NA's :58 NA's :255645 NA's :255282
minimum_temperature
Min. :20.0
1st Qu.:24.3
Median :25.2
Mean :25.3
3rd Qu.:26.3
Max. :30.0
NA's :255283
The observations ranged from 1 Jan 1980 to 31 Dec 2023. There are 58 rows with missing dates. We should drop these rows since they are unable to tell us which day the observations were made (even if they have temperature readings).
There are 255,645 rows of NAs in daily mean temperature.
There are 255,282 rows of NAs in daily maximum temperature.
There are 255,283 rows of NAs in daily minimum temperature.
We noted that there are a lot of missing values. As the aim of this task is to forecast future temperatures, missing value treatment is not so straight-forward. Imputation using mean, median & mode might hide trends or seasonal patterns whereas removing missing data points altogether might reduce information contained in other features. Let’s understand more about the missing values before we proceed to do imputation for missing values.
First, let us drop those rows where date is missing because we would not be able to definitively identify when the temperature(s) were collected (even if there were temperature readings for these rows.
station tdate mean_temperature maximum_temperature
Length:329098 Min. :1980-01-01 Min. :22.20 Min. :22.80
Class :character 1st Qu.:1997-04-29 1st Qu.:27.10 1st Qu.:30.50
Mode :character Median :2011-09-18 Median :27.90 Median :31.70
Mean :2007-07-02 Mean :27.87 Mean :31.49
3rd Qu.:2017-11-27 3rd Qu.:28.80 3rd Qu.:32.60
Max. :2023-12-31 Max. :31.50 Max. :38.00
NA's :255587 NA's :255224
minimum_temperature
Min. :20.0
1st Qu.:24.3
Median :25.2
Mean :25.3
3rd Qu.:26.3
Max. :30.0
NA's :255225
We noted that there are many weather stations in the temp dataframe. Hence, we will make use of plotly to further explore the missing temperatures.
Let us first explore the daily mean temperatures by selecting the relevant columns and pivot the dataframe wider.
tdate Macritchie Reservoir Lower Peirce Reservoir
Min. :1980-01-01 Min. : NA Min. : NA
1st Qu.:1990-12-31 1st Qu.: NA 1st Qu.: NA
Median :2001-12-31 Median : NA Median : NA
Mean :2001-12-31 Mean :NaN Mean :NaN
3rd Qu.:2012-12-30 3rd Qu.: NA 3rd Qu.: NA
Max. :2023-12-31 Max. : NA Max. : NA
NA's :16071 NA's :16071
Admiralty East Coast Parkway Ang Mo Kio Newton
Min. :22.50 Min. :23.40 Min. :22.50 Min. :22.20
1st Qu.:26.80 1st Qu.:27.50 1st Qu.:26.90 1st Qu.:26.80
Median :27.60 Median :28.20 Median :27.80 Median :27.70
Mean :27.66 Mean :28.13 Mean :27.82 Mean :27.58
3rd Qu.:28.50 3rd Qu.:28.90 3rd Qu.:28.80 3rd Qu.:28.40
Max. :30.80 Max. :30.80 Max. :31.20 Max. :30.60
NA's :10821 NA's :10806 NA's :10918 NA's :11148
Lim Chu Kang Marine Parade Choa Chu Kang (Central) Tuas South
Min. : NA Min. : NA Min. : NA Min. :23.10
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.:27.40
Median : NA Median : NA Median : NA Median :28.20
Mean :NaN Mean :NaN Mean :NaN Mean :28.19
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:29.00
Max. : NA Max. : NA Max. : NA Max. :31.00
NA's :16071 NA's :16071 NA's :16071 NA's :11609
Pasir Panjang Jurong Island Nicoll Highway Botanic Garden
Min. :23.20 Min. :23.40 Min. : NA Min. : NA
1st Qu.:27.50 1st Qu.:27.60 1st Qu.: NA 1st Qu.: NA
Median :28.30 Median :28.40 Median : NA Median : NA
Mean :28.25 Mean :28.29 Mean :NaN Mean :NaN
3rd Qu.:29.10 3rd Qu.:29.10 3rd Qu.: NA 3rd Qu.: NA
Max. :31.30 Max. :31.10 Max. : NA Max. : NA
NA's :11049 NA's :11748 NA's :16071 NA's :16071
Choa Chu Kang (South) Whampoa Changi Jurong Pier
Min. :22.70 Min. : NA Min. :22.80 Min. : NA
1st Qu.:26.80 1st Qu.: NA 1st Qu.:26.90 1st Qu.: NA
Median :27.70 Median : NA Median :27.70 Median : NA
Mean :27.68 Mean :NaN Mean :27.69 Mean :NaN
3rd Qu.:28.60 3rd Qu.: NA 3rd Qu.:28.60 3rd Qu.: NA
Max. :31.00 Max. : NA Max. :30.90 Max. : NA
NA's :11558 NA's :16071 NA's :731 NA's :16071
Ulu Pandan Mandai Tai Seng Jurong (West)
Min. : NA Min. : NA Min. :23.2 Min. :22.20
1st Qu.: NA 1st Qu.: NA 1st Qu.:27.6 1st Qu.:26.60
Median : NA Median : NA Median :28.4 Median :27.40
Mean :NaN Mean :NaN Mean :28.4 Mean :27.41
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:29.3 3rd Qu.:28.30
Max. : NA Max. : NA Max. :31.5 Max. :30.60
NA's :16071 NA's :16071 NA's :11454 NA's :10995
Clementi Sentosa Island Bukit Panjang Kranji Reservoir
Min. :22.80 Min. :23.00 Min. : NA Min. : NA
1st Qu.:26.90 1st Qu.:27.40 1st Qu.: NA 1st Qu.: NA
Median :27.70 Median :28.20 Median : NA Median : NA
Mean :27.63 Mean :28.12 Mean :NaN Mean :NaN
3rd Qu.:28.50 3rd Qu.:28.90 3rd Qu.: NA 3rd Qu.: NA
Max. :30.60 Max. :31.10 Max. : NA Max. : NA
NA's :11258 NA's :11317 NA's :16071 NA's :16071
Upper Peirce Reservoir Kent Ridge Queenstown Tanjong Katong
Min. : NA Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA Max. : NA
NA's :16071 NA's :16071 NA's :16071 NA's :16071
Somerset (Road) Punggol Simei Toa Payoh
Min. : NA Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA Max. : NA
NA's :16071 NA's :16071 NA's :16071 NA's :16071
Tuas Bukit Timah Pasir Ris (Central)
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :16071 NA's :16071 NA's :16071
We will make use of plotly to explore the missing daily temperatures for each station using a dropdown list.
plot_ly(data = temp_mean_wide,
x = ~tdate,
y = ~ Admiralty,
type = "scatter",
mode = "lines") |>
layout(title = "Temperature observed by Weather Stations",
xaxis = list(title = "Date"),
yaxis = list(title = "", range = c(0,40)),
theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,
axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),
updatemenus = list(list(type = 'dropdown',
xref = "paper",
yref = "paper",
xanchor = "left",
x = 0.04,
y = 0.95,
buttons = list(
list(method = "update",
args = list(list(y = list(temp_mean_wide$Admiralty)),
list(yaxis = list(title = "Temperature in Admiralty", range = c(0,40)))),label = "Admiralty"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`East Coast Parkway`)),
list(yaxis = list(title = "Temperature in East Coast Parkway", range = c(0,40)))),label = "East Coast Parkway"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Ang Mo Kio`)),
list(yaxis = list(title = "Temperature in Ang Mo Kio", range = c(0,40)))),label = "Ang Mo Kio"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Newton)),
list(yaxis = list(title = "Temperature in Newton", range = c(0,40)))),label = "Newton"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Tuas South`)),
list(yaxis = list(title = "Temperature in Tuas South", range = c(0,40)))),label = "Tuas South"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Pasir Panjang`)),
list(yaxis = list(title = "Temperature in Pasir Panjang", range = c(0,40)))),label = "Pasir Panjang"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Jurong Island`)),
list(yaxis = list(title = "Temperature in Jurong Island", range = c(0,40)))),label = "Jurong Island"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (South)`)),
list(yaxis = list(title = "Temperature in Choa Chu Kang (South)", range = c(0,40)))),label = "Choa Chu Kang (South)"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Changi)),
list(yaxis = list(title = "Temperature in Changi", range = c(0,40)))),label = "Changi"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Tai Seng`)),
list(yaxis = list(title = "Temperature in Tai Seng", range = c(0,40)))),label = "Tai Seng"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Jurong (West)`)),
list(yaxis = list(title = "Temperature in Jurong West", range = c(0,40)))),label = "Jurong West"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Clementi)),
list(yaxis = list(title = "Temperature in Clementi", range = c(0,40)))),label = "Clementi"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Sentosa Island`)),
list(yaxis = list(title = "Temperature in Sentosa", range = c(0,40)))),label = "Sentosa"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Macritchie Reservoir`)),
list(yaxis = list(title = "Temperature at Macritchie Reservoir", range = c(0,40)))),label = "Macritchie Reservoir"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Lower Peirce Reservoir`)),
list(yaxis = list(title = "Temperature at Lower Peirce Reservoir", range = c(0,40)))),label = "Lower Peirce Reservoir"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Lim Chu Kang`)),
list(yaxis = list(title = "Temperature at Lim Chu Kang", range = c(0,40)))),label = "Lim Chu Kang"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Marine Parade`)),
list(yaxis = list(title = "Temperature at Marine Parade", range = c(0,40)))),label = "Marine Parade"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (Central)`)),
list(yaxis = list(title = "Temperature at Choa Chu Kang (Central)", range = c(0,40)))),label = "Choa Chu Kang (Central)"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (Central)`)),
list(yaxis = list(title = "Temperature at Choa Chu Kang (Central)", range = c(0,40)))),label = "Choa Chu Kang (Central)"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Nicoll Highway`)),
list(yaxis = list(title = "Temperature at Nicoll Highway", range = c(0,40)))),label = "Nicoll Highway"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Botanic Garden`)),
list(yaxis = list(title = "Temperature at Botanic Garden", range = c(0,40)))),label = "Botanic Garden"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Whampoa)),
list(yaxis = list(title = "Temperature at Whampoa", range = c(0,40)))),label = "Whampoa"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Jurong Pier`)),
list(yaxis = list(title = "Temperature at Jurong Pier", range = c(0,40)))),label = "Jurong Pier"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Ulu Pandan`)),
list(yaxis = list(title = "Temperature at Ulu Pandan", range = c(0,40)))),label = "Ulu Pandan"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Mandai)),
list(yaxis = list(title = "Temperature at Mandai", range = c(0,40)))),label = "Mandai"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Bukit Panjang`)),
list(yaxis = list(title = "Temperature at Bukit Panjang", range = c(0,40)))),label = "Bukit Panjang"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Kranji Reservoir`)),
list(yaxis = list(title = "Temperature at Kranji Reservoir", range = c(0,40)))),label = "Kranji Reservoir"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Upper Peirce Reservoir`)),
list(yaxis = list(title = "Temperature at Upper Peirce Reservoir", range = c(0,40)))),label = "Upper Peirce Reservoir"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Kent Ridge`)),
list(yaxis = list(title = "Temperature at Kent Ridge", range = c(0,40)))),label = "Kent Ridge"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$Queenstown)),
list(yaxis = list(title = "Temperature at Queenstown", range = c(0,40)))),label = "Queenstown"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Tanjong Katong`)),
list(yaxis = list(title = "Temperature at Tanjong Katong", range = c(0,40)))),label = "Tanjong Katong"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Somerset (Road)`)),
list(yaxis = list(title = "Temperature at Somerset (Road)", range = c(0,40)))),label = "Somerset (Road)"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Punggol`)),
list(yaxis = list(title = "Temperature at Punggol", range = c(0,40)))),label = "Punggol"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Simei`)),
list(yaxis = list(title = "Temperature at Simei", range = c(0,40)))),label = "Simei"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Toa Payoh`)),
list(yaxis = list(title = "Temperature at Toa Payoh", range = c(0,40)))),label = "Toa Payoh"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Tuas`)),
list(yaxis = list(title = "Temperature at Tuas", range = c(0,40)))),label = "Tuas"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Bukit Timah`)),
list(yaxis = list(title = "Temperature at Bukit Timah", range = c(0,40)))),label = "Bukit Timah"),
list(method = "update",
args = list(list(y = list(temp_mean_wide$`Pasir Ris (Central)`)),
list(yaxis = list(title = "Temperature at Pasir Ris (Central)", range = c(0,40)))),label = "Pasir Ris (Central)")
)))) It seems like there are some weather stations with no temperature data at all. We should remove them from the temp dataframe.
There are some weather stations (e.g. Admiralty) have temperature data only from a certain year onwards (e.g. 2009), and some stations (e.g. Changi) have temperature data as early as 1980s.
For those weather stations with temperature data, they also have missing values over a given time period. So we will need to decide how to handle these missing values in subsequent sections.
Let us identify the amount of missing values for each weather station using the following code chunk.
missing_values <- temp_mean_wide %>%
gather(key = "key", value = "val") %>%
mutate(isna = is.na(val)) %>%
group_by(key) %>%
mutate(total = n()) %>%
group_by(key, total, isna) %>%
summarise(num.isna = n()) %>%
mutate(pct = num.isna / total * 100)
levels <-
(missing_values %>% filter(isna == T) %>% arrange(desc(pct)))$key
percentage_plot <- missing_values %>%
ggplot() +
geom_bar(aes(x = reorder(key, desc(pct)),
y = pct, fill=isna),
stat = 'identity', alpha=0.8) +
scale_x_discrete(limits = levels) +
scale_fill_manual(name = "",
values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
coord_flip() +
labs(title = "Percentage of missing values", x =
'Variable', y = "% of missing values")
percentage_plot
The above output is consistent with what we observed when exploring the data using plotly. There are numerous stations without temperature readings throughout all years and there are certain stations with temperature readings during certain time periods.
Let us find out which stations that have no temperature readings throughout the entire time period using filter().We will filter out those weather stations that have 100% NAs.
[1] "Botanic Garden" "Bukit Panjang"
[3] "Bukit Timah" "Choa Chu Kang (Central)"
[5] "Jurong Pier" "Kent Ridge"
[7] "Kranji Reservoir" "Lim Chu Kang"
[9] "Lower Peirce Reservoir" "Macritchie Reservoir"
[11] "Mandai" "Marine Parade"
[13] "Nicoll Highway" "Pasir Ris (Central)"
[15] "Punggol" "Queenstown"
[17] "Simei" "Somerset (Road)"
[19] "Tanjong Katong" "Toa Payoh"
[21] "Tuas" "Ulu Pandan"
[23] "Upper Peirce Reservoir" "Whampoa"
From the above output, we know that these 24 weather stations have no temperature readings. We will put them into a list and create an operator to exclude them from the temp data using filter().
stationstoremove <- c("Botanic Garden","Bukit Panjang","Bukit Timah","Choa Chu Kang (Central)","Jurong Pier","Kent Ridge", "Kranji Reservoir", "Lim Chu Kang", "Lower Peirce Reservoir", "Macritchie Reservoir","Mandai", "Marine Parade","Nicoll Highway", "Pasir Ris (Central)", "Punggol", "Queenstown","Simei", "Somerset (Road)","Tanjong Katong", "Toa Payoh", "Tuas", "Ulu Pandan", "Upper Peirce Reservoir","Whampoa")
#create a operator to exclude things
'%!in%' <- function(x,y)!('%in%'(x,y))
#excluded stations that have no temp data at all
temp_clean <- temp %>%
filter(station %!in% stationstoremove)
glimpse(temp_clean)Rows: 120,139
Columns: 5
$ station <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ tdate <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-04, 2…
$ mean_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maximum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ minimum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
[1] "Admiralty" "East Coast Parkway" "Ang Mo Kio"
[4] "Newton" "Tuas South" "Pasir Panjang"
[7] "Jurong Island" "Choa Chu Kang (South)" "Changi"
[10] "Tai Seng" "Jurong (West)" "Clementi"
[13] "Sentosa Island"
We will then pivot the temp_clean dataframe wider and plot the daily temperature for the remaining weather stations using plotly.
Rows: 16,071
Columns: 14
$ tdate <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-0…
$ Admiralty <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `East Coast Parkway` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Ang Mo Kio` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Newton <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Tuas South` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Pasir Panjang` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Jurong Island` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Choa Chu Kang (South)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Changi <dbl> 26.6, 26.4, 26.5, 26.3, 27.0, 27.4, 27.1, 27.0…
$ `Tai Seng` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Jurong (West)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Clementi <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Sentosa Island` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
plot_ly(data = temp_mean_widec,
x = ~tdate,
y = ~ Admiralty,
type = "scatter",
mode = "lines+markers") |>
layout(title = "Temperature observed by Weather Station",
xaxis = list(title = "Date"),
yaxis = list(title = "", range = c(0,40)),
theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,
axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),
updatemenus = list(list(type = 'dropdown',
xref = "paper",
yref = "paper",
xanchor = "left",
x = 0.04,
y = 0.95,
buttons = list(
list(method = "update",
args = list(list(y = list(temp_mean_widec$Admiralty)),
list(yaxis = list(title = "Temperature in Admiralty", range = c(0,40)))),label = "Admiralty"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`East Coast Parkway`)),
list(yaxis = list(title = "Temperature in East Coast Parkway", range = c(0,40)))),label = "East Coast Parkway"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Ang Mo Kio`)),
list(yaxis = list(title = "Temperature in Ang Mo Kio", range = c(0,40)))),label = "Ang Mo Kio"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$Newton)),
list(yaxis = list(title = "Temperature in Newton", range = c(0,40)))),label = "Newton"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Tuas South`)),
list(yaxis = list(title = "Temperature in Tuas South", range = c(0,40)))),label = "Tuas South"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Pasir Panjang`)),
list(yaxis = list(title = "Temperature in Pasir Panjang", range = c(0,40)))),label = "Pasir Panjang"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Jurong Island`)),
list(yaxis = list(title = "Temperature in Jurong Island", range = c(0,40)))),label = "Jurong Island"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Choa Chu Kang (South)`)),
list(yaxis = list(title = "Temperature in Choa Chu Kang", range = c(0,40)))),label = "Choa Chu Kang"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$Changi)),
list(yaxis = list(title = "Temperature in Changi", range = c(0,40)))),label = "Changi"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Tai Seng`)),
list(yaxis = list(title = "Temperature in Tai Seng", range = c(0,40)))),label = "Tai Seng"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Jurong (West)`)),
list(yaxis = list(title = "Temperature in Jurong West", range = c(0,40)))),label = "Jurong West"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$Clementi)),
list(yaxis = list(title = "Temperature in Clementi", range = c(0,40)))),label = "Clementi"),
list(method = "update",
args = list(list(y = list(temp_mean_widec$`Sentosa Island`)),
list(yaxis = list(title = "Temperature in Sentosa", range = c(0,40)))),label = "Sentosa")
)))) From the above interactive chart, we note that some stations have a longer time period with temperature readings (e.g. Changi). Almost all stations have some missing time gaps/ values in the data, hence we will need to do imputation for this missing values to ensure better accuracy of our temperature forecasting.
Also, we noted that daily temperature readings that range more than 20 years is too frequent for time series forecasting. Hence, we will aggregate the daily temperature readings to monthly temperature readings by calculating the mean in subsequent section.
In the previous sections, we noted that the dataframes were all tibble dataframes. For us to make use of the time series forecasting packages and their functions, we would need to convert the tibble dataframe into a time series object.
Before we create the time series object, let us first aggregate the daily temperature readings to monthly temperature readings by (1) creating the year-month column for each observation using floor_date() and specifying it to derive the year and month of each observation, and (2) aggregate the temperature readings by station and year_month then use summarise() to compute the monthly averages for mean_temperature, maximum_temperature and minimum_temperature.
Rows: 120,139
Columns: 6
$ station <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ tdate <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-04, 2…
$ mean_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maximum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ minimum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ year_month <date> 2009-01-01, 2009-01-01, 2009-01-01, 2009-01-01, 2…
Rows: 3,947
Columns: 5
Groups: station [13]
$ station <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ year_month <date> 2009-01-01, 2009-02-01, 2009-03-01, 2009-04-01, 2…
$ mean_temperature <dbl> NA, 26.76786, NA, 28.12000, 28.48387, 28.89667, 28…
$ maximum_temperature <dbl> NA, 31.44286, NA, 32.19667, 32.59032, 32.87000, 31…
$ minimum_temperature <dbl> NA, 24.26071, NA, 25.06667, 25.09355, 25.95000, 25…
With the monthly temperature of all weather stations, let us filter out one weather station (e.g. Admiralty) to create a tibble data frame adm so that we can convert it into an xts object, which is a type of time series object.
station year_month mean_temperature maximum_temperature
Length:179 Min. :2009-01-01 Min. :25.61 Min. :28.86
Class :character 1st Qu.:2012-09-16 1st Qu.:27.10 1st Qu.:31.33
Mode :character Median :2016-07-01 Median :27.74 Median :31.85
Mean :2016-06-19 Mean :27.68 Mean :31.83
3rd Qu.:2020-03-16 3rd Qu.:28.29 3rd Qu.:32.45
Max. :2023-12-01 Max. :29.15 Max. :33.87
NA's :22 NA's :18
minimum_temperature
Min. :23.63
1st Qu.:24.52
Median :24.92
Mean :24.98
3rd Qu.:25.46
Max. :26.45
NA's :18
We will use xts() from xts package to create a time series object. The order.by parameter uses the dates from the adm dataframe. We then use the ts_regular() function to give the time series object adm_xts a regular interval by adding NA values for missing dates.
Just in case there are missing months which we did not detected, we use the na.fill() function fills in those missing dates by extending values from previous days.
Let us plot out the monthly mean temperature of Admiralty weather station using ggplotly.
From the above output, we see that there are missing temperatures for numerous time periods. As a result, the line for the above chart is not continuous.
Let us investigate further using imputeTS package’s ggplot_na_distribution, which highlights the missing values in our data. For the following example, we focus on the mean temperature of the adm time series object.

We also use the imputeTS package’s statsNA to have a report on the number of missing mean temperature readings.
[1] "Length of time series:"
[1] 180
[1] "-------------------------"
[1] "Number of Missing Values:"
[1] 23
[1] "-------------------------"
[1] "Percentage of Missing Values:"
[1] "12.8%"
[1] "-------------------------"
[1] "Number of Gaps:"
[1] 12
[1] "-------------------------"
[1] "Average Gap Size:"
[1] 1.916667
[1] "-------------------------"
[1] "Stats for Bins"
[1] " Bin 1 (45 values from 1 to 45) : 3 NAs (6.67%)"
[1] " Bin 2 (45 values from 46 to 90) : 4 NAs (8.89%)"
[1] " Bin 3 (45 values from 91 to 135) : 11 NAs (24.4%)"
[1] " Bin 4 (45 values from 136 to 180) : 5 NAs (11.1%)"
[1] "-------------------------"
[1] "Longest NA gap (series of consecutive NAs)"
[1] "6 in a row"
[1] "-------------------------"
[1] "Most frequent gap size (series of consecutive NA series)"
[1] "1 NA in a row (occurring 7 times)"
[1] "-------------------------"
[1] "Gap size accounting for most NAs"
[1] "1 NA in a row (occurring 7 times, making up for overall 7 NAs)"
[1] "-------------------------"
[1] "Overview NA series"
[1] " 1 NA in a row: 7 times"
[1] " 2 NA in a row: 2 times"
[1] " 3 NA in a row: 2 times"
[1] " 6 NA in a row: 1 times"
There are several ways to impute missing data in time series objects. We need to impute missing values because some of the models cannot handle NAs in Time Series objects.
As this function calculates moving averages based on the last n observations, it will generally be performing better than using mean, mode and median imputation. Moving averages work well when data has a linear trend. This function also allows us to use linear-weighted and exponentially-weighted moving averages.
adm_imp_movingavg <- na_ma(adm_xts, weighting = "exponential") #default is exponential. Other options are "simple" and "linear". We can allow users to choose if the option they want.
#plot chart
#ggplot(adm_imp_movingavg, aes(x = Index, y = mean_temperature)) +
#geom_line()
plot_ma<- ggplot(adm_imp_movingavg, aes(x = Index, y = mean_temperature)) +
geom_line() + theme_clean() +
labs(title = "Monthly Mean Temperature of Admiralty Weather Station \n(missing values imputed using moving average)") +
xlab("Month-Year") +
ylab("Temperature in degrees celsius") +
theme_ipsum_rc()
ggplotly(plot_ma)We can also use Kalman Smoothing on ARIMA model to impute the missing values.

From the above output, we see that some of the imputed values are below 0 degrees celsius which is impossible in Singapore. As such, we will not be using this method to impute missing values for temperature readings.
Kalman Smoothing also has a “StrucTS” option. Let us try and see how it works for our temperature data.

From the above output, it seems like using the “StrucTS” model works better than auto.arima model since the imputed results were reasonable. Again, we can also let users choose which model they want to use.
Before we model the time series forecasting model, let us test is our time series data is stationary. Stationarity signifies that the statistical properties of time series, such as mean, variance, and covariance, remain constant over time, which is the fundamental assumption for many time series modeling techniques.It simplifies the complex dynamics within the data, making it more amenable to analysis, modeling, and forecasting.
There are two tests we are use to test for stationarity: - Augmented Dickey-Fuller (ADF) Test; and - Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test
Null Hypothesis: Series is non-stationary, or series has a unit root. Alternative Hypothesis: Series is stationary, or series has no unit root.
If the null hypothesis fails to be rejected, this test may provide evidence that the series is non-stationary.
Augmented Dickey-Fuller Test
data: adm_imp_movingavg$mean_temperature
Dickey-Fuller = -7.3405, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
Since the p-value is 0.01, which is less than critical value of 0.05, we reject the null hypothesis. This means that the time series does not have a unit root, meaning it is stationary. It does not have a time-dependent structure.
Null Hypothesis: Series is trend stationary or series has no unit root. Alternative Hypothesis: Series is non-stationary, or series has a unit root.
Note: The hypothesis is reversed in the KPSS test compared to ADF Test.
KPSS Test for Level Stationarity
data: adm_imp_movingavg$mean_temperature
KPSS Level = 0.086074, Truncation lag parameter = 4, p-value = 0.1
Since the p-value is 0.1, which is greater than the critical value of 0.05, we fail to reject the null hypothesis of the KPSS test.This means we can assume that the time series is trend stationary.
Both ADF and KPSS tests conclude that the given series is stationary. This means that we can make use of most of the time series forecasting models such as Exponential Smoothing and ARIMA.
Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several component to help us improve our understanding of the time series and forecast accuracy.
First, let us plot the monthly mean temperature of the Admiralty weather station.
From the above output, it seems like there were fluctuations in monthly mean temperature but there was no increasing trend.
It also seems like for each year, the mean temperature would usually be the highest in May/Jun of each year as indicated by the peaks. Also, for each year, the lowest mean temperature would usually be around Dec/ Jan.
To find out if there is a seasonality, trend and cycle, we can decompose a time series object using stl()from xts package. STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships. The STL method was developed by R. B. Cleveland, Cleveland, McRae, & Terpenning (1990).
In the following code chunk, we use: - stl() to decompose the time series object - ts_ts() function from the library converts an xts field to a ts object that can be used with stl().

Call:
stl(x = ts_ts(adm_imp_movingavg$mean_temperature), s.window = "periodic")
Time.series components:
seasonal trend remainder
Min. :-0.9176109 Min. :27.30177 Min. :-0.9341622
1st Qu.:-0.5109855 1st Qu.:27.43056 1st Qu.:-0.2329592
Median : 0.1857624 Median :27.65282 Median :-0.0124042
Mean : 0.0000000 Mean :27.66756 Mean :-0.0019506
3rd Qu.: 0.4489919 3rd Qu.:27.85388 3rd Qu.: 0.2320894
Max. : 0.7298073 Max. :28.19072 Max. : 0.9402058
IQR:
STL.seasonal STL.trend STL.remainder data
0.9600 0.4233 0.4650 1.0218
% 94.0 41.4 45.5 100.0
Weights: all == 1
Other components: List of 5
$ win : Named num [1:3] 1801 19 13
$ deg : Named int [1:3] 0 1 1
$ jump : Named num [1:3] 181 2 2
$ inner: int 2
$ outer: int 0
From the above output, we noted that there is no clear linear trend for the monthly mean temperature over the years. However, we do note that the monthly mean temperature ranges from 27.3 degrees celsius to ~28.2 degree celsius.
We observed seasonality in the monthly mean temperature over the years.
First we will split the data into training and validation data.
When choosing models, it is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.
The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required.
For this section, we will set the test set ~20% of the dataframe we have. However when building the Shiny dashboard, we should allow user input on the duration they want to forecast (e.g. next few months or next few years) because this would affect the size of the test dataset.
Some forecasting methods are extremely simple and surprisingly effective. We will use the following two forecasting methods (i.e. naive and seasonal naive) as benchmarks.
For naïve forecasts, we simply set all forecasts to be the value of the last observation.
Mean absolute percentage error (MAPE) is the percentage equivalent of mean absolute error (MAE). Mean absolute percentage error measures the average magnitude of error produced by a model, or how far off predictions are on average.
To measure the performance of how well the model’s forecasted values as compared to the test dataset, we use MAPE() of MLmetrics package to calculate the MAPE.
From the above output, we have a MAPE of 2.79% meaning that the average difference between the forecasted value and the actual value is 2.79%.For a simple model, this forecasting accuracy is very good! It means that other models introduced would need to have a even lower MAPE in order for us to consider them.
A similar method is useful for highly seasonal data. In this case, we set each forecast to be equal to the last observed value from the same season (e.g.,the same month of the previous year).
From the above output, we have a MAPE of 2.07% meaning that the average difference between the forecasted value and the actual value is 2.07%.For a simple model, this forecasting accuracy is even better than the naive model! It means that other models introduced would need to have a even lower MAPE in order for us to consider them.
The simplest of the exponentially smoothing methods is naturally called simple exponential smoothing (SES). This method is suitable for forecasting data with no clear trend or seasonal pattern.
From the above output, we have a MAPE of 2.78% meaning that the average difference between the forecasted value and the actual value is 2.78%, which is slightly better than the naive model and poorer performance than the seasonal naive model.
State space models provide a flexible framework for modeling time series data. They consist of two components: the state equation and the observation equation. The state equation describes how the underlying states of the system evolve over time, while the observation equation relates the observed data to the underlying states.
State space models allow us to capture complex dynamics and dependencies in the data, making them suitable for a wide range of applications, including finance, economics and engineering.
We use ets() from forecast package to find out the optimal model.
ETS(M,N,A)
Call:
ets(y = trainingtemp)
Smoothing parameters:
alpha = 0.2716
gamma = 1e-04
Initial states:
l = 27.8746
s = -0.8262 -0.4477 0.1394 0.235 0.4386 0.5042
0.6772 0.6073 0.3135 -0.0647 -0.631 -0.9455
sigma: 0.0158
AIC AICc BIC
544.5750 548.0035 590.3228
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.005080769 0.4168655 0.336926 -0.03791676 1.219642 0.7034614
ACF1
Training set 0.1037638
We see ETS (M,N,A). This means we have an ets model with multiplicative errors, no trend and a additive seasonality. Additive seasonality means there aren’t any changes to widths or heights of seasonal periods over time.
From the above output, we have a MAPE of 1.47% meaning that the average difference between the forecasted value and the actual value is 1.47%, which is so far the best performing model.
Since time series analysis decomposes past weather observations into seasonal, trend, and random components, that data can be used to create forecasts based on an assumption that the observed patterns will continue into the future. This type of forecasting is especially useful in business for anticipating potential demand for seasonal products.
To forecast future temperatures based on historical observations, we can use Holt-Winters model that considers past seasonal cycles, trends, and random variation.
Note that for Holt-Winters’ method, we can choose additive or multiplicative seasonality to forecast the monthly mean temperature, which can be part of the user input.
From the above output, we have a MAPE of 1.47% meaning that the average difference between the forecasted value and the actual value is 1.47%, which gives us as good performance as the State Space Model.
From the above output, we have a MAPE of 1.53% meaning that the average difference between the forecasted value and the actual value is 1.53%, which is relatively slightly poorer than the Holt-Winters’ Additive Seasonality Model.
ARIMA models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, auto regressive integrated moving average (ARIMA) modeling involves a more detailed analysis of the training data using lags and lagged forecast errors.
The first step is to use a function like auto.arima() to analyze the data and find appropriate model configuration parameters.
Series: trainingtemp
ARIMA(1,0,1)(2,1,1)[12] with drift
Coefficients:
ar1 ma1 sar1 sar2 sma1 drift
0.7912 -0.4842 0.0130 -0.1503 -0.8839 0.0013
s.e. 0.1180 0.1735 0.1349 0.1298 0.1996 0.0018
sigma^2 = 0.1948: log likelihood = -93.88
AIC=201.76 AICc=202.59 BIC=222.55
The function returned the following model: ARIMA (1,0,1)(2,1,1)[12] with drift.
From the above output, we have a MAPE of 1.55% meaning that the average difference between the forecasted value and the actual value is 1.55%, which gives us better performance than the benchmark models.
| Model | MAPE (%) |
|---|---|
| Naive | 2.79 |
| Seasonal Naive | 2.07 |
| Simple Exponential Smoothing | 2.78 |
| State Space Model | 1.47 |
| Holt-Winters’ Model (Additive Seasonality) | 1.47 |
| Holt-Winters’ Model (Multiplicative Seasonality) | 1.53 |
| ARIMA | 1.55 |
From the above table, the state space model, Holt-Winters’ model and ARIMA model all outperformed the benchmark models (i.e. naive Model and Seasonal Naive Model) for temperature data. We can consider letting users to choose to use these models when forecasting temperature data.
In this section, we will be cleaning the rainfall data and building models for the rainfall data.
First we select the relevant rainfall related columns needed for this exercise using select(). We will examine the resultant dataframe rainfall using str().
rainfall) is a tibble dataframe with the following columns:
tdate: refers to the date of the rainfall reading is collected.station: refers to the weather station that collected this rainfall reading.daily_rainfall_total: refers to the total amount of rainfall observed by this weather station in a day. The unit of measurement is in mm.Let us save the dataframe into an RDS file for future usage using write_rds().
We will bring in the rainfall data using read_rds(). Let us check the imported RDS data using str() again.
tibble [329,156 × 3] (S3: tbl_df/tbl/data.frame)
$ tdate : Date[1:329156], format: "1980-01-01" "1980-01-02" ...
$ station : chr [1:329156] "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" ...
$ daily_rainfall_total: num [1:329156] 0 0 0 0 22.6 49.6 2.4 0 0 0 ...
rainfall is a tibble dataframe.station is character data type, tdate is date data type and daily_rainfall_total is numeric data.First, let us use summary() to check for missing data.
tdate station daily_rainfall_total
Min. :1980-01-01 Length:329156 Min. : 0.000
1st Qu.:1997-04-29 Class :character 1st Qu.: 0.000
Median :2011-09-18 Mode :character Median : 0.200
Mean :2007-07-02 Mean : 6.822
3rd Qu.:2017-11-27 3rd Qu.: 6.500
Max. :2023-12-31 Max. :278.600
NA's :58 NA's :5136
The observations ranged from 1 Jan 1980 to 31 Dec 2023. There are 58 rows with missing dates. We should drop these rows since they are unable to tell us which day the observations were made (even if they have rainfall readings).
There are 5,136 rows of NAs for daily rainfall.
First, let us drop those rows where date is missing because we would not be able to definitively identify when the temperature(s) were collected (even if there were temperature readings for these rows.
tdate station daily_rainfall_total
Min. :1980-01-01 Length:329098 Min. : 0.000
1st Qu.:1997-04-29 Class :character 1st Qu.: 0.000
Median :2011-09-18 Mode :character Median : 0.200
Mean :2007-07-02 Mean : 6.822
3rd Qu.:2017-11-27 3rd Qu.: 6.500
Max. :2023-12-31 Max. :278.600
NA's :5078
Let us also do a check of the weather stations.
[1] "Macritchie Reservoir" "Lower Peirce Reservoir"
[3] "Admiralty" "East Coast Parkway"
[5] "Ang Mo Kio" "Newton"
[7] "Lim Chu Kang" "Marine Parade"
[9] "Choa Chu Kang (Central)" "Tuas South"
[11] "Pasir Panjang" "Jurong Island"
[13] "Nicoll Highway" "Botanic Garden"
[15] "Choa Chu Kang (South)" "Whampoa"
[17] "Changi" "Jurong Pier"
[19] "Ulu Pandan" "Mandai"
[21] "Tai Seng" "Jurong (West)"
[23] "Clementi" "Sentosa Island"
[25] "Bukit Panjang" "Kranji Reservoir"
[27] "Upper Peirce Reservoir" "Kent Ridge"
[29] "Queenstown" "Tanjong Katong"
[31] "Somerset (Road)" "Punggol"
[33] "Simei" "Toa Payoh"
[35] "Tuas" "Bukit Timah"
[37] "Pasir Ris (Central)"
From the previous section, we noted that there are many weather stations in the rainfall dataframe. Hence, we will make use of plotly to further explore the missing temperatures.
First, we will pivot the dataframe wider.
tdate Macritchie Reservoir Lower Peirce Reservoir
Min. :1980-01-01 Min. : 0.000 Min. : 0.000
1st Qu.:1990-12-31 1st Qu.: 0.000 1st Qu.: 0.000
Median :2001-12-31 Median : 0.200 Median : 0.400
Mean :2001-12-31 Mean : 7.145 Mean : 7.706
3rd Qu.:2012-12-30 3rd Qu.: 7.100 3rd Qu.: 8.200
Max. :2023-12-31 Max. :256.000 Max. :227.600
NA's :121 NA's :11141
Admiralty East Coast Parkway Ang Mo Kio Newton
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.400 Median : 0.000 Median : 0.400 Median : 0.200
Mean : 6.735 Mean : 4.977 Mean : 7.218 Mean : 6.625
3rd Qu.: 6.800 3rd Qu.: 3.800 3rd Qu.: 7.800 3rd Qu.: 6.800
Max. :142.000 Max. :192.600 Max. :164.400 Max. :150.400
NA's :10785 NA's :10778 NA's :10901 NA's :11112
Lim Chu Kang Marine Parade Choa Chu Kang (Central) Tuas South
Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.400 Median : 0.00 Median : 0.40 Median : 0.200
Mean : 6.759 Mean : 5.61 Mean : 7.51 Mean : 6.892
3rd Qu.: 7.000 3rd Qu.: 4.95 3rd Qu.: 8.60 3rd Qu.: 6.200
Max. :158.600 Max. :195.40 Max. :144.20 Max. :208.200
NA's :11214 NA's :10933 NA's :10994 NA's :11472
Pasir Panjang Jurong Island Nicoll Highway Botanic Garden
Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.20 Median : 0.000 Median : 0.200 Median : 0.200
Mean : 6.29 Mean : 6.287 Mean : 6.548 Mean : 7.283
3rd Qu.: 6.00 3rd Qu.: 5.600 3rd Qu.: 6.200 3rd Qu.: 7.800
Max. :151.60 Max. :185.200 Max. :163.800 Max. :160.400
NA's :11031 NA's :11582 NA's :11269 NA's :11228
Choa Chu Kang (South) Whampoa Changi Jurong Pier
Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.400 Median : 0.200 Median : 0.00 Median : 0.200
Mean : 7.469 Mean : 6.888 Mean : 5.81 Mean : 7.131
3rd Qu.: 8.200 3rd Qu.: 7.100 3rd Qu.: 4.40 3rd Qu.: 7.000
Max. :143.800 Max. :154.600 Max. :216.20 Max. :226.400
NA's :11491 NA's :11606 NA's :275
Ulu Pandan Mandai Tai Seng Jurong (West)
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.200 Median : 0.400 Median : 0.000 Median : 0.300
Mean : 7.201 Mean : 7.218 Mean : 6.757 Mean : 7.224
3rd Qu.: 6.900 3rd Qu.: 7.150 3rd Qu.: 5.900 3rd Qu.: 7.000
Max. :230.400 Max. :247.200 Max. :217.200 Max. :226.200
NA's :355 NA's :828 NA's :5 NA's :445
Clementi Sentosa Island Bukit Panjang Kranji Reservoir
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.300 Median : 0.000 Median : 0.400 Median : 0.300
Mean : 7.216 Mean : 6.166 Mean : 7.316 Mean : 7.041
3rd Qu.: 7.200 3rd Qu.: 5.100 3rd Qu.: 7.400 3rd Qu.: 7.100
Max. :239.500 Max. :220.800 Max. :235.600 Max. :239.800
NA's :207 NA's :365 NA's :227 NA's :234
Upper Peirce Reservoir Kent Ridge Queenstown Tanjong Katong
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.400 Median : 0.200 Median : 0.200 Median : 0.100
Mean : 7.527 Mean : 7.365 Mean : 6.805 Mean : 6.136
3rd Qu.: 7.900 3rd Qu.: 7.400 3rd Qu.: 6.100 3rd Qu.: 5.100
Max. :202.800 Max. :179.600 Max. :278.600 Max. :226.000
NA's :11168 NA's :10858 NA's :773 NA's :392
Somerset (Road) Punggol Simei Toa Payoh
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.200 Median : 0.200 Median : 0.000 Median : 0.200
Mean : 6.387 Mean : 6.594 Mean : 5.987 Mean : 7.211
3rd Qu.: 6.200 3rd Qu.: 6.400 3rd Qu.: 5.200 3rd Qu.: 7.600
Max. :155.800 Max. :168.800 Max. :182.600 Max. :150.200
NA's :11427 NA's :10767 NA's :11013 NA's :11037
Tuas Bukit Timah Pasir Ris (Central)
Min. : 0.000 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.200 Median : 0.40 Median : 0.200
Mean : 7.198 Mean : 7.13 Mean : 6.165
3rd Qu.: 7.600 3rd Qu.: 7.40 3rd Qu.: 5.400
Max. :217.000 Max. :156.80 Max. :185.800
NA's :10690 NA's :10826 NA's :11057
We will make use of plotly to explore the daily rainfall for each station using a dropdown list.
plot_ly(data = rain_wide,
x = ~tdate,
y = ~ Admiralty,
type = "scatter",
mode = "lines") |>
layout(title = "Total Rain Fall observed by Weather Station",
xaxis = list(title = "Date", range(as.Date("1980-01-01"), as.Date("2023-12-31"))),
yaxis = list(title = ""),
theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,
axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),
updatemenus = list(list(type = 'dropdown',
xref = "paper",
yref = "paper",
xanchor = "left",
x = 0.04,
y = 0.95,
buttons = list(
list(method = "update",
args = list(list(y = list(rain_wide$Admiralty)),
list(yaxis = list(title = "Total Rainfall observed in Admiralty"))),label = "Admiralty"),
list(method = "update",
args = list(list(y = list(rain_wide$`East Coast Parkway`)),
list(yaxis = list(title = "Total Rainfall observed in East Coast Parkway"))),label = "East Coast Parkway"),
list(method = "update",
args = list(list(y = list(rain_wide$`Ang Mo Kio`)),
list(yaxis = list(title = "Total Rainfall observed in Ang Mo Kio"))),label = "Ang Mo Kio"),
list(method = "update",
args = list(list(y = list(rain_wide$Newton)),
list(yaxis = list(title = "Total Rainfall observed in Newton"))),label = "Newton"),
list(method = "update",
args = list(list(y = list(rain_wide$`Tuas South`)),
list(yaxis = list(title = "Total Rainfall observed in Tuas South"))),label = "Tuas South"),
list(method = "update",
args = list(list(y = list(rain_wide$`Pasir Panjang`)),
list(yaxis = list(title = "Total Rainfall observed in Pasir Panjang"))),label = "Pasir Panjang"),
list(method = "update",
args = list(list(y = list(rain_wide$`Jurong Island`)),
list(yaxis = list(title = "Total Rainfall observed in Jurong Island"))),label = "Jurong Island"),
list(method = "update",
args = list(list(y = list(rain_wide$`Choa Chu Kang (South)`)),
list(yaxis = list(title = "Total Rainfall observed in Choa Chu Kang (South)"))),label = "Choa Chu Kang (South)"),
list(method = "update",
args = list(list(y = list(rain_wide$Changi)),
list(yaxis = list(title = "Total Rainfall observed in Changi"))),label = "Changi"),
list(method = "update",
args = list(list(y = list(rain_wide$`Tai Seng`)),
list(yaxis = list(title = "Total Rainfall observed in Tai Seng"))),label = "Tai Seng"),
list(method = "update",
args = list(list(y = list(rain_wide$`Jurong (West)`)),
list(yaxis = list(title = "Total Rainfall observed in Jurong West"))),label = "Jurong West"),
list(method = "update",
args = list(list(y = list(rain_wide$Clementi)),
list(yaxis = list(title = "Total Rainfall observed in Clementi"))),label = "Clementi"),
list(method = "update",
args = list(list(y = list(rain_wide$`Sentosa Island`)),
list(yaxis = list(title = "Total Rainfall observed in Sentosa"))),label = "Sentosa"),
list(method = "update",
args = list(list(y = list(rain_wide$`Macritchie Reservoir`)),
list(yaxis = list(title = "Total Rainfall observed at Macritchie Reservoir"))),label = "Macritchie Reservoir"),
list(method = "update",
args = list(list(y = list(rain_wide$`Lower Peirce Reservoir`)),
list(yaxis = list(title = "Total Rainfall observed at Lower Peirce Reservoir"))),label = "Lower Peirce Reservoir"),
list(method = "update",
args = list(list(y = list(rain_wide$`Lim Chu Kang`)),
list(yaxis = list(title = "Total Rainfall observed at Lim Chu Kang"))),label = "Lim Chu Kang"),
list(method = "update",
args = list(list(y = list(rain_wide$`Marine Parade`)),
list(yaxis = list(title = "Total Rainfall observed at Marine Parade"))),label = "Marine Parade"),
list(method = "update",
args = list(list(y = list(rain_wide$`Choa Chu Kang (Central)`)),
list(yaxis = list(title = "Total Rainfall observed at Choa Chu Kang (Central)"))),label = "Choa Chu Kang (Central)"),
list(method = "update",
args = list(list(y = list(rain_wide$`Nicoll Highway`)),
list(yaxis = list(title = "Total Rainfall observed at Nicoll Highway"))),label = "Nicoll Highway"),
list(method = "update",
args = list(list(y = list(rain_wide$`Botanic Garden`)),
list(yaxis = list(title = "Total Rainfall observed at Botanic Garden"))),label = "Botanic Garden"),
list(method = "update",
args = list(list(y = list(rain_wide$Whampoa)),
list(yaxis = list(title = "Total Rainfall observed at Whampoa"))),label = "Whampoa"),
list(method = "update",
args = list(list(y = list(rain_wide$`Jurong Pier`)),
list(yaxis = list(title = "Total Rainfall observed at Jurong Pier"))),label = "Jurong Pier"),
list(method = "update",
args = list(list(y = list(rain_wide$`Ulu Pandan`)),
list(yaxis = list(title = "Total Rainfall observed at Ulu Pandan"))),label = "Ulu Pandan"),
list(method = "update",
args = list(list(y = list(rain_wide$Mandai)),
list(yaxis = list(title = "Total Rainfall observed at Mandai"))),label = "Mandai"),
list(method = "update",
args = list(list(y = list(rain_wide$`Bukit Panjang`)),
list(yaxis = list(title = "Total Rainfall observed at Bukit Panjang"))),label = "Bukit Panjang"),
list(method = "update",
args = list(list(y = list(rain_wide$`Kranji Reservoir`)),
list(yaxis = list(title = "Total Rainfall observed at Kranji Reservoir"))),label = "Kranji Reservoir"),
list(method = "update",
args = list(list(y = list(rain_wide$`Upper Peirce Reservoir`)),
list(yaxis = list(title = "Total Rainfall observed at Upper Peirce Reservoir"))),label = "Upper Peirce Reservoir"),
list(method = "update",
args = list(list(y = list(rain_wide$`Kent Ridge`)),
list(yaxis = list(title = "Total Rainfall observed at Kent Ridge"))),label = "Kent Ridge"),
list(method = "update",
args = list(list(y = list(rain_wide$Queenstown)),
list(yaxis = list(title = "Total Rainfall observed at Queenstown"))),label = "Queenstown"),
list(method = "update",
args = list(list(y = list(rain_wide$`Tanjong Katong`)),
list(yaxis = list(title = "Total Rainfall observed at Tanjong Katong"))),label = "Tanjong Katong"),
list(method = "update",
args = list(list(y = list(rain_wide$`Somerset (Road)`)),
list(yaxis = list(title = "Total Rainfall observed at Somerset (Road)"))),label = "Somerset (Road)"),
list(method = "update",
args = list(list(y = list(rain_wide$`Punggol`)),
list(yaxis = list(title = "Total Rainfall observed at Punggol"))),label = "Punggol"),
list(method = "update",
args = list(list(y = list(rain_wide$`Simei`)),
list(yaxis = list(title = "Total Rainfall observed at Simei"))),label = "Simei"),
list(method = "update",
args = list(list(y = list(rain_wide$`Toa Payoh`)),
list(yaxis = list(title = "Total Rainfall observed at Toa Payoh"))),label = "Toa Payoh"),
list(method = "update",
args = list(list(y = list(rain_wide$`Tuas`)),
list(yaxis = list(title = "Total Rainfall observed at Tuas"))),label = "Tuas"),
list(method = "update",
args = list(list(y = list(rain_wide$`Bukit Timah`)),
list(yaxis = list(title = "Total Rainfall observed at Bukit Timah"))),label = "Bukit Timah"),
list(method = "update",
args = list(list(y = list(rain_wide$`Pasir Ris (Central)`)),
list(yaxis = list(title = "Total Rainfall observed at Pasir Ris (Central)"))),label = "Pasir Ris (Central)")
)))) Let us find out the amount of missing values for each weather station using the following code chunk.
missing_values <- rain_wide %>%
gather(key = "key", value = "val") %>%
mutate(isna = is.na(val)) %>%
group_by(key) %>%
mutate(total = n()) %>%
group_by(key, total, isna) %>%
summarise(num.isna = n()) %>%
mutate(pct = num.isna / total * 100)
levels <-
(missing_values %>% filter(isna == T) %>% arrange(desc(pct)))$key
percentage_plot <- missing_values %>%
ggplot() +
geom_bar(aes(x = reorder(key, desc(pct)),
y = pct, fill=isna),
stat = 'identity', alpha=0.8) +
scale_x_discrete(limits = levels) +
scale_fill_manual(name = "",
values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
coord_flip() +
labs(title = "Percentage of missing values", x =
'Variable', y = "% of missing values")Let us check if there are any stations where they have 100% missing values.
character(0)
From the above output, it seems like no stations have 100% missing values.
Let us check if there are any stations with no missing values.
[1] "Changi" "tdate"
From the above output, it seems like only Changi weather station has no missing values. This means that for other weather stations there are some amount of missing data for each weather station. Let us impute the missing values in the next section.
As mentioned earlier, for us to make use of the time series forecasting packages and their functions, we would need to convert the tibble dataframe into a time series object.
Before we create the time series object, let us first aggregate the daily rainfall readings to monthly rainfall readings by (1) creating the year-month column for each observation using floor_date() and specifying it to derive the year and month of each observation, and (2) aggregate the temperature readings by station and year_month then use summarise() to compute the monthly rainfall reading.
Rows: 329,098
Columns: 4
$ tdate <date> 1980-01-01, 1980-01-02, 1980-01-03, 1980-01-04, …
$ station <chr> "Macritchie Reservoir", "Macritchie Reservoir", "…
$ daily_rainfall_total <dbl> 0.0, 0.0, 0.0, 0.0, 22.6, 49.6, 2.4, 0.0, 0.0, 0.…
$ year_month <date> 1980-01-01, 1980-01-01, 1980-01-01, 1980-01-01, …
Rows: 10,813
Columns: 3
Groups: station [37]
$ station <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty", "Admira…
$ year_month <date> 2009-01-01, 2009-02-01, 2009-03-01, 2009-04-01, 2009-05-01…
$ total_rf <dbl> NA, 148.0, NA, 148.8, 205.6, 92.0, 103.0, 90.2, 67.6, 160.0…
With the monthly temperature of all weather stations, let us filter out one weather station (e.g. Admiralty) to create a tibble data frame adm_rf so that we can convert it into an xts object, which is a type of time series object.
station year_month total_rf
Length:179 Min. :2009-01-01 Min. : 15.8
Class :character 1st Qu.:2012-09-16 1st Qu.:124.4
Mode :character Median :2016-07-01 Median :189.0
Mean :2016-06-19 Mean :204.9
3rd Qu.:2020-03-16 3rd Qu.:270.4
Max. :2023-12-01 Max. :517.4
NA's :18
We will use xts() from xts package to create a time series object. The order.by parameter uses the dates from the adm_rf dataframe. We then use the ts_regular() function to give the time series object adm_rf_xts a regular interval by adding NA values for missing dates.
Just in case there are missing months which we did not detected, we use the na.fill() function fills in those missing dates by extending values from previous days.
Let us plot out the monthly rainfall of Admiralty weather station using ggplotly.
p3 <- ggplot(adm_rf_xts, aes(x = Index, y = value)) +
geom_line() + theme_clean() +
labs(title = "Monthly Rainfall of Admiralty Weather Station", caption = "Data from Weather.gov.sg") +
xlab("Month-Year") +
ylab("Rainfall (in mm)") +
theme_ipsum_rc()
ggplotly(p3)From the above output, we see that there are missing temperatures for numerous time periods. As a result, the line for the above chart is not continuous.
Let us investigate futher using imputeTS package’s ggplot_na_distribution, which highlights the missing values in our data.
There are several ways to impute missing data in time series objects.
This na_ma()function also allows us to use linear-weighted and exponentially-weighted moving averages.
We can also use Kalman Smoothing on ARIMA model to impute the missing values.
admrf_imp_kalman <- na_kalman(adm_rf_xts, model = "auto.arima")
#plot chart
ggplot(admrf_imp_kalman, aes(x = Index, y = value)) +
geom_line()
Kalman Smoothing also has a “StrucTS” option. Let us try and see how it works for our monthly rainfall data.
admrf_imp_kalmans <- na_kalman(adm_rf_xts, model = "StructTS")
#plot chart
ggplot(admrf_imp_kalmans, aes(x = Index, y = value)) +
geom_line()
For the subsequent sections, we will use the results from moving average to build our model.
Similar to the temperature data, we will also test for stationarity for our rainfall data.
Augmented Dickey-Fuller Test
data: admrf_imp_movingavg$value
Dickey-Fuller = -4.5358, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
Since the p-value is 0.01, which is less than critical value of 0.05, we reject the null hypothesis. This means that the time series does not have a unit root, meaning it is stationary. It does not have a time-dependent structure.
KPSS Test for Level Stationarity
data: admrf_imp_movingavg$value
KPSS Level = 0.20185, Truncation lag parameter = 4, p-value = 0.1
Since the p-value is 0.1, which is greater than the critical value of 0.05, we fail to reject the null hypothesis of the KPSS test.This means we can assume that the time series is trend stationary.
Both ADF and KPSS tests conclude that the given series is stationary. This means that we can make use of most of the time series forecasting models such as Exponential Smoothing and ARIMA.
First, let us plot the monthly rainfall of the Admiralty weather station.
p4 <- ggplot(admrf_imp_movingavg, aes(x = Index, y = value)) +
geom_line() +
geom_smooth(method=lm)
ggplotly(p4)From the above output, it seems like there were fluctuations in monthly rainfall but there was no increasing trend. We also cannot see if there are any seasonality.
Let us decompose the time series object and explore further.
fit <- stl(ts_ts(admrf_imp_movingavg), s.window= "periodic")
## plot out the decomposition results
autoplot(fit)+
ggtitle("Decomposition for Monthly Rainfall") +
xlab("Month-Year") +
theme_clean()
Call:
stl(x = ts_ts(admrf_imp_movingavg), s.window = "periodic")
Time.series components:
seasonal trend remainder
Min. :-57.77066 Min. :136.8089 Min. :-162.14714
1st Qu.:-26.90668 1st Qu.:172.5815 1st Qu.: -58.59127
Median : -4.90670 Median :199.8634 Median : -0.22377
Mean : 0.00000 Mean :203.0772 Mean : -0.11451
3rd Qu.: 24.09846 3rd Qu.:236.0493 3rd Qu.: 51.80263
Max. : 77.82946 Max. :290.0645 Max. : 271.94494
IQR:
STL.seasonal STL.trend STL.remainder data
51.01 63.47 110.39 133.60
% 38.2 47.5 82.6 100.0
Weights: all == 1
Other components: List of 5
$ win : Named num [1:3] 1801 19 13
$ deg : Named int [1:3] 0 1 1
$ jump : Named num [1:3] 181 2 2
$ inner: int 2
$ outer: int 0
From the above output, we noted that there is no clear linear trend for the monthly rainfall over the years. However, we do note that the inital rainfall in the earlier time period were around 150mm and the later time period has a rainfall level of 200mm.
We also observed seasonality in the monthly rainfall over the years.
First we will split the data into training and validation data. Again, similar to temperature data, we will let users decide how much training data and test data they would want, and the duration of the forecast.
From the above output, we have a MAPE of ~40% meaning that the average difference between the forecasted value and the actual value is ~40%. As compared to the temperature data, this MAPE is considered very high and there are a lot of room for improvement.
The MAPE for seasonal naive model is worse than the naive model with it have a MAPE of ~70%. Let’s try other models and see if there any improvements.
autoplot(snaive_model) +
ggtitle("Seasonal Naive Forecasts for Monthly Total Rainfall") +
xlab("Month-Year") +
ylab("Amount of Rainfall (in mm)") +
theme_ipsum_rc()
From the above output, we have a MAPE of ~43% meaning that the average difference between the forecasted value and the actual value is ~43%. This MAPE is similar to the naive model.
ETS(M,N,A)
Call:
ets(y = trainingrf)
Smoothing parameters:
alpha = 0.2289
gamma = 3e-04
Initial states:
l = 211.8336
s = 28.6946 69.9082 11.4358 -25.3137 -47.8125 -12.06
-50.4517 33.2959 61.8826 6.3161 -69.7022 -6.1932
sigma: 0.4492
AIC AICc BIC
2185.839 2189.267 2231.587
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.02588395 82.13912 65.51488 -21.92718 45.45668 0.6772009
ACF1
Training set 0.03461945
We see ETS (M,N,A). This means we have an ets model with multiplicative errors, no trend and a additive seasonality. Additive seasonality means there aren’t any changes to widths or heights of seasonal periods over time.
From the above output, we have a MAPE of 45.5%, which is worse than the naive model, our benchmark model.
| Model | MAPE (%) |
|---|---|
| Naive | 39.96% |
| Seasonal Naive | 69.78% |
| Simple Exponential Smoothing | 42.79% |
| Holt-Winters’ Additive Seasonality | 48.54% |
| Holt-Winters’ Multiplicative Seasonality | 63.31% |
| ARIMA | 40.20% |
From the above table, we see that except for ARIMA model which has a MAPE of 40%, none of the models outperformed the naive model when it comes to forecasting of rainfall data. As such, we can consider exposing only Naive and ARIMA models for rainfall data prediction on our Shiny Application.
After working on the previous sections, we can see that not every models work for temperature and rainfall data. Also, we have to impute missing data before we can build such time series forceasting models. As such, the following sections summarise the parameters and output to expose on Shiny application and how we can go about exposing these parameters and output.
Type of data to forecast:
Length of forecast:
Which weather station
Method of imputing missing data
Forecasting model to use
For temperature data, we can consider allow users to choose from Holt-Winters, State Space and ARIMA models since we got very good data.
For rainfall data, if we choose to allow users to do forecasting on our Shiny application, we can consider to allow users to choose from Naive or ARIMA model.
Data table and plot of the output for missing data imputation chosen
Data table and plot of the output for the forecasting model chosen
MAPE of the forecasting model chosen
| Parameter / Output | How to expose it |
|---|---|
| Type of data to forecast | Each data (i.e. temperature or rainfall) is in a different page. Hence, user can choose which data to forecast by clicking on the respective page on the Navbar Pages. |
| Weather Station | A dropdown list to allow user to select which weather station they want to use for forecasting |
| Temperature / rainfall data of chosen weather station | After selecting the type of data to forecast and the weather station, we can display an interactive plot of the past data and show if there are any missing data. This can help the user choose which imputation method to use for the next step. |
| Method of Imputing Missing Data | A dropdown list to allow user to select which missing data imputation method they want to use. The options are:
|
| Data table and plot of the output for the missing data | After selecting the type of missing data imputation method, we can display an interactive plot and data table of the resultant dataframe. This can help them decide which imputation method is most suitable. For example in the earlier section, we realised that Kalman Smoothing (ARIMA) was not suitable for Admiralty weather station’s monthly temperature data after plotting the chart using the resultant dataframe. |
| Decomposition of Time Series Data | To help the users choose the forecasting model, we can also display the output of the time series data’s decomposition |
| Forecasting Model to use | A dropdown list to allow user to select which forecasting method they want to use. For temperature data, the forecasting methods are:
For rainfall data, the forecasting methods are:
|
| Length of forecast | A numeric input field to allow user key in the number of months to forecast |
| Data table, MAPE and plot of the forecasted results | After the user has chosen the forecasting method and indicated the length of forecast, we should display the data table of forecasted results, the MAPE and plot ouf the forecasted results. |
Based on the above considerations, we plan to have the following layout for the forecasting module in our Shiny Application. Below is an example for the forecasting of temperature data.

Another page with similar in layout would also be created for the forecasting of rainfall data. The sidebar would allow users to make their selections and the mainpanel will display the various plots and data table using tabsets. The MAE, RMSE and MAPE results for the chosen forecasting model will also be displayed at the bottom of the page.
www.weather.gov.sg
https://michaelminn.net/tutorials/r-weather/index.html https://towardsdatascience.com/a-guide-to-forecasting-in-r-6b0c9638c261
https://www.singstat.gov.sg/-/media/files/publications/reference/ssnsep05-pg11-14.ashx
https://otexts.com/fpp2/estimation-and-model-selection.html
https://www.michaelplazzer.com/weather-forecasting-with-machine-learning-in-r/
https://jtr13.github.io/EDAVold/missingTS.html
https://cran.r-project.org/web/packages/imputeTS/vignettes/imputeTS-Time-Series-Missing-Value-Imputation-in-R.pdf
https://www.analyticsvidhya.com/blog/2021/06/statistical-tests-to-check-stationarity-in-time-series-part-1/
https://hex.tech/blog/stationarity-in-time-series/
https://pkg.robjhyndman.com/forecast/reference/auto.arima.html